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Abstract - In DSP systems, the performance of multiplier’s 
plays a crucial role in determining the processor’s performance. 
In this paper, a high speed multi precision multiplier with 
minimum area and power consumption is proposed. This 
multiplier also enables parallel processing so that it is possible to 
perform higher precision multiplications. The main focus of this 
paper is to increase the performance of the multipliers. The 
speed of a multiplier relies on generation of partial products. 
Here, it is suggested to use compressing techniques to improve 
the speed of multipliers. In addition to that scaling of supply 
voltage and frequency management are also done. This flexible 
multiplier combining variable precision processing, voltage and 
frequency management can be used efficiently to reduce circuit 
power consumption and delay. Simulation of results is done on 
ModelSim 6.3f and synthesis of power and area is done on Xilinx 
ISE Design 8.1. 

Index Terms - DSP; multi precision; parallel processing 
I. INTRODUCTION 

Multipliers are the key components in digital signal 
processors, microprocessors, FIR filters etc. Since multipliers 
are the slowest element in the system, the performance of a 
system depends on its multipliers. Also high precision 
multipliers consumes large amount of area in DSP kits. 
Therefore it is important to optimize speed and performance 
of a multiplier. The process of multiplication includes the 
following three steps: 1. Generation of partial products. 2. 
Partial products are reduced to one row of final sums and one 
row of carries. 3. The final sums and carries are added to 
generate the result. 

Generally multipliers are typically designed for fixed 
maximum word length to suit the worst case conditions. This 
would result in power loss thereby reducing the efficiency of a 
multiplier. Numerous works has been done for this word 
length optimization. Earlier, word length optimization was 
achieved by taking the advantage of routing the incoming 
operands to the smallest multiplier that can compute the 
result. 

But it was an expensive method. Later a method of reusing 
the functional units was introduced as a solution to this 
problem. A dramatic reduction in power consumption can 
achieved by using error tolerant DVS [1]. Error tolerant DVS 
based on razor flip flop overcome the limitations of the 
conventional DVS. Combining MP multiplier with DVS can 
provide a dramatic reduction in power consumption by 
adjusting the voltage according to circuit’s run-time, 
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workload rather than fixing it to cater the worst case 
situations. The main focus of this paper is to propose a novel 
method to improve the multiplier’s efficiency. By using 4:2 
compressors, it is possible to optimize the delay introduced by 
the multiplier. 

II. RELATED WORKS 

A. 4 subblock multiplier 

FWM generate an output with the same width as the input. 
But in this case it is inefficient to perform a smaller precision 
multiplication in a high precision multiplier. Therefore a multi 
precision multiplier was designed. 

Let U and V be 2n-bit wide multiplicand and multiplier 
respectively. U H and V H are the n-bit MSB’s and U L and V L 
are the n-bit LSB’s. The multiplication result can be 
expressed by the following equation: 

P = (U H V h ) 2 2n + (U H V l + U L V H ) 2“ + U L V L (1) 

This equation reveals that multiplication process 
requires four n*n multipliers are required. 

Comparison of this 4-subblock multiplier with 
conventional FWM shows that this would overheads of 13% 
and 18% for the power and silicon area respectively. This 
resulted in working with 3-subblock multipliers. 


U H V, V-- V L Ul V k Ui V L 



Fig. 1. Block diagram of 4 subblock multiplier 
B.3 subblock multiplier 
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Fig.2 Block diagram of Three subblock multiplier 


In 3-subblock multiplier, it is defined as follows: 

U1 = u H + u L 


whereas half adders are used in each column where there are 
two bits. Any single bit in a column is passed to the next stage 
in the same column without processing. This reduction 
procedure is repeated in each successive stage until only two 
rows remain. In the final step, the remaining two rows are 
added using a carry propagating adder. In a conventional 
Wallace multiplier, the number of rows in subsequent stages 
can be calculated as: 

r i+1 = 2[xJ 3] + q mod 3 

Where, q mod 3 denotes the smallest non-negative 
remainder of q/3. 

The tree multiplier realizes substantial hardware savings 
for larger multipliers. The propagation delay is reduced as 
well. In fact, it can be shown that the propagation delay 
through the tree is equal to O (log (N)). 

Usually Wallace tree multiplier algorithm is most 
commonly in the digital multiplier. The delay generated in 
Wallace tree circuit can be reduced by using approximate 
compressors. Compressors are used to accumulate the partial 
products in the multiplication process. In this technique, all 
columns of partial products are added in parallel without 
delaying for the carry signal from the previous column. 
Conventional full adder can be considered as a 3:2 
compressor. But in a 3:2 compressor the path delay is 
irregular. This is due to the presence of two XOR gates in the 
circuit. 


VI = V H + V L 

Then the equation for the product can be rewritten as 
P = (U H V h ) 2 2n + (U1 VI - U H V h - U L V L ) 2" + U L V L (2) 

From equation 2, it is clear that one n*n bit multiplier and 
one 2n- bit adder is replaced by two n- bit adder and 2n + 2 bit 
subtractor. So inorder to perform 32- bit multiplication on a 
16 bit multiplier it is only required to use two 34 bit 
subtractor. This results in the reduction of silicon area and 
power head of 4 subblock multipliers. 

III. PROPOSED ARCHITECTURE 

The selection of a multiplication algorithm depends on the 
application to be performed be a multiplier. Array based 
algorithm are the most commonly used due to its regular 
structure. In array multiplier the circuit is based on add and 
shift algorithm. The addition can be done by using normal 
carry propagate adder. With the objective of further 
improvement in the speed of the parallel multiplier Wallace 
tree algorithm was proposed. 

In Wallace tree architecture the partial products are 
rearranged in tree like fashion so that the critical path and the 
number of adder cells to be used are reduced. In Wallace tree 
architecture all the bits in each column of the partial product 
are added simultaneously. A set of counters in each column is 
used to generate a new matrix of partial products. This 
method continues until a matrix of two rows is generated. 
First row represents the sum bits and the other row represent 
the carry bits. The most common counter used is 3:2 counter, 
which is a full adder circuit. 

In the conventional Wallace tree multiplier, the first step is 
to form partial product array (of N 2 bits). In the second step, 
groups of three adjacent rows each, is collected. Each group 
of three rows is reduced by using full adders and half adders. 
Full adders are used in each column where there are three bits 


A. 4:2 compressor 

As a solution to the delay of 3:2 compressor, a 4:2 
compressor was proposed. The 4:2 compressor is built by 
connecting two 3:2 compressor in series. 



SUM CARRY 

Fig.3 Block diagram of 4:2 compressor 



Fig.4 Design of 4:2 compressor 

Here, the architecture is connected in such a way that four 
of inputs are coming from the same bit position of the weight j 
while one bit is fed from the j-1 position. The outputs of 4:2 
compressor consists of one bit in the position j and two bits in 
the position j+1. The output Cout, being independent of the 
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input Cin accelerates the carry save summation of the partial 
product. 

In a conventional Wallace tree multiplier the five partial 
product bits is compressed into 4 and so on. But in a Wallace 
tree multiplier modified with a 4:2 compressor, compresses 
the five partial product into three. This enables to minimize 
the irregularity in the delay of a Wallace tree multiplier. 


a modified Wallace algorithm. The modified algorithm uses a 
4:2 compressor. Thus it is possible to develop a 
multiprecision multiplier with optimum area, power and 
delay. 
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IV. RESULTS AND DISCUSSIONS 

The program code is done in the VHDL language. VHDL 
is a very powerful, high level, concurrent programming 
language. 

A. Simulation Result 

ModelSim6.3f is used for the simulation of the code. Here 
x and y are the input bits and out 1 is the output bit. 
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Fig.5. Simulation result of 3 subblock multiplier modified with 4:2 
compressor 

B. Area and power analysis 

Area and power analysis of FWM ,4 subblock and 3 
subblock multipliers running at 50 MHz frequency can done 
by using Xilinx ISE Design suite 8.1. 


TABLE I 

AREA AND POWER ANALYSIS 


Scheme 

POWER(MW) 

NO: OL GATES 

USED 

32 BIT 4 SUBBLOCK MP MULTIPLIER 

134 

33,191 

32 BIT 3 SUBBLOCK MP MULTIPLIER 

104 

25,773 

32 BIT 3 SUBBLOCK MULTIPILER WITH 4:2 

COMPRESSOR 

86 

25,221 
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V. CONCLUSION 

Multipliers are the key component in digital circuits. 
Studies show that FWM are very much inefficient in DSP 
processors. So multiprecision multipliers which result in 
minimized area and power consumption is opted. Further, the 
speed of a multiprecision multiplier can be improved by using 
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